Introduction
Hello! And welcome to our first tutorial! In this tutorial, we will analyze data from Chicago Food Inspections from the years 2010-2019. We will visualize and try to create predictions of the data .
Here we load the data locally to be able to manipulate it and use it later.
chicago <- read_csv("/Users/ilanamakover/320/food-inspections.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Inspection ID` = col_double(),
## `License #` = col_double(),
## Zip = col_double(),
## `Inspection Date` = col_datetime(format = ""),
## Latitude = col_double(),
## Longitude = col_double(),
## `Historical Wards 2003-2015` = col_double(),
## `Zip Codes` = col_double(),
## `Community Areas` = col_double(),
## `Census Tracts` = col_double(),
## Wards = col_double()
## )
## See spec(...) for full column specifications.
Let’s take a look at a small portion of our data.
chicago[1:5, 1:22] %>%
mutate(Violations = substr(Violations, 1, 50))%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Inspection ID | DBA Name | AKA Name | License # | Facility Type | Risk | Address | City | State | Zip | Inspection Date | Inspection Type | Results | Violations | Latitude | Longitude | Location | Historical Wards 2003-2015 | Zip Codes | Community Areas | Census Tracts | Wards |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2288309 | BLACK EAGLE CLUB | BLACK EAGLE CLUB | 2653260 | Restaurant | Risk 1 (High) | 1938 W IRVING PARK RD | CHICAGO | IL | 60613 | 2019-05-07 | License | Fail |
|
41.95428 | -87.67788 | {‘longitude’: ‘-87.67788273287823’, ‘latitude’: ‘41.95427796120616’} | 13 | 21186 | 46 | 622 | 18 |
| 2288284 | NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA | NIBBLES & NOSH/MIDNIGHT MAC & CHEESERIE/XO MARSHMA | 2658934 | Restaurant | Risk 1 (High) | 6977-6981 N SHERIDAN RD | CHICAGO | IL | 60626 | 2019-05-07 | License | Pass w/ Conditions |
|
42.00903 | -87.66189 | {‘longitude’: ‘-87.66188729620244’, ‘latitude’: ‘42.00902867704194’} | 3 | 21853 | 10 | 267 | 5 |
| 2288315 | ROBERT’S PIZZA AND DOUGH COMPANY | ROBERT’S PIZZA AND DOUGH COMPANY | 2617087 | Restaurant | Risk 1 (High) | 411 E ILLINOIS ST | CHICAGO | IL | 60611 | 2019-05-07 | License | Pass w/ Conditions |
|
41.89095 | -87.61719 | {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} | 22 | 21182 | 37 | 534 | 36 |
| 2288303 | THE FAT SHALLOT | THE FAT SHALLOT | 2373569 | Mobile Food Preparer | Risk 2 (Medium) | 2300 S THROOP ST | CHICAGO | IL | 60608 | 2019-05-07 | License | Fail |
|
41.85045 | -87.65880 | {‘longitude’: ‘-87.65879785567869’, ‘latitude’: ‘41.85045102427’} | 8 | 14920 | 33 | 126 | 26 |
| 2288325 | ROBERT’S PIZZA AND DOUGH COMPANY | ROBERT’S PIZZA AND DOUGH COMPANY | 2617143 | Restaurant | Risk 1 (High) | 411 E ILLINOIS ST | CHICAGO | IL | 60611 | 2019-05-07 | License | Pass w/ Conditions | NA | 41.89095 | -87.61719 | {‘longitude’: ‘-87.61718975017543’, ‘latitude’: ‘41.89095012024265’} | 22 | 21182 | 37 | 534 | 36 |
Okay. great. We have the names of the establishment and other relavent information about the inspection including but not limited to address, risk level, result of inspections and violations.
Data Cleaning
We are only interested in establishments that either pass, pass with conditions or fail, therefore we are going to remove any business that didn’t fit in these categories.
Data from 2019 is removed as 2019 only has 5 months worth of data compared to the data collected between 2010 and 2018 where each of those years has twelve months worth of data.
New columns are created to help with later computations. For example, a year column is added where we extract the year from the Inspection Date column.
A risk value column is added to discretize risk. The Risk column has 3 categories: “Risk 1 (High)”, “Risk 2 (Medium)”, and “Risk 3(Low)”. We create a new column which converts the 3 categories into 3 numeric values. 3 represents Risk 1 (High), 2 represents Risk 2 (Medium) and 1 represents Risk 3 (Low).
A results value column is added to discretize results. We create a new column which converts the 3 categories of the result of inspection, into 3 numeric values. If the establishment fails inspections, the score is 1, if the establishments passes with a condition, the score is 2 and if the establishment passes, the score is 3.
Write a comment about how/why we created more columns
chicago <- chicago %>%
#filter out things we do not need or want
filter(Results != "Not Ready" & Results != "Business Not Located" & Results != "No Entry" & Results
!= "Out of Business" & Results != "NA" & Risk != "All" & `Facility Type` != "NA")%>%
#add comments about what we are doing here
mutate(Pass = ifelse(Results == "Pass" | Results =="Pass w/ Conditions", 1, 0))%>%
mutate(countBakeryPass = ifelse(`Facility Type` == "Bakery",
ifelse(Results != "Fail", 1, 0), 0))%>%
mutate(countGSPass = ifelse(`Facility Type` == "Grocery Store",
ifelse(Results != "Fail", 1, 0), 0))%>%
#grab the year of the inspection date
mutate(year = as.numeric(gsub("(^[0-9]{4})(-[0-9]{2}-)([0-9]{2})", "\\1", `Inspection Date`))) %>%
filter(year != 2019)%>%
#adding risk value
mutate(`Risk Value` = ifelse(Risk == "Risk 1 (High)", 3,
ifelse(Risk == "Risk 2 (Medium)", 2, 1))) %>%
#adding results value
mutate(`Results Value` = ifelse(Results == "Fail", 1,
ifelse(Risk == "Pass w/ Conditions", 2, 3)))
Here is a small snippet of what our new dataset looks like.
chicago[1:5, 1:28] %>%
mutate(Violations = substr(Violations, 1, 50))%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Inspection ID | DBA Name | AKA Name | License # | Facility Type | Risk | Address | City | State | Zip | Inspection Date | Inspection Type | Results | Violations | Latitude | Longitude | Location | Historical Wards 2003-2015 | Zip Codes | Community Areas | Census Tracts | Wards | Pass | countBakeryPass | countGSPass | year | Risk Value | Results Value |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2243951 | DONGPO IMPRESSION | DONGPO IMPRESSION | 2551932 | Restaurant | Risk 1 (High) | 228 W CERMAK RD | CHICAGO | IL | 60616 | 2018-12-31 | Complaint | Fail |
|
41.85294 | -87.63322 | {‘longitude’: ‘-87.6332244476211’, ‘latitude’: ‘41.85294483663146’} | 8 | 21194 | 35 | 3 | 26 | 0 | 0 | 0 | 2018 | 3 | 1 |
| 2243947 | DOMINO’S | DOMINO’S | 2637330 | Restaurant | Risk 2 (Medium) | 509 N ORLEANS ST | CHICAGO | IL | 60654 | 2018-12-31 | License | Pass | NA | 41.89108 | -87.63687 | {‘longitude’: ‘-87.63687267718794’, ‘latitude’: ‘41.8910754076385’} | 22 | 4446 | 37 | 652 | 36 | 1 | 0 | 0 | 2018 | 2 | 3 |
| 2243952 | RANALLIS | RANALLIS | 1738099 | Restaurant | Risk 1 (High) | 1508-1512 W BERWYN AVE | CHICAGO | IL | 60640 | 2018-12-31 | Canvass | Pass w/ Conditions |
|
41.97812 | -87.66876 | {‘longitude’: ‘-87.6687558692362’, ‘latitude’: ‘41.97812335776488’} | 46 | 22616 | 76 | 564 | 24 | 1 | 0 | 0 | 2018 | 3 | 3 |
| 2243942 | DUNKIN DONUTS | DUNKIN DONUTS | 2536449 | Restaurant | Risk 2 (Medium) | 309 W CHICAGO AVE | CHICAGO | IL | 60654 | 2018-12-31 | Canvass | Pass w/ Conditions |
|
41.89646 | -87.63610 | {‘longitude’: ‘-87.63609804227903’, ‘latitude’: ‘41.8964577550414’} | 22 | 4446 | 37 | 652 | 36 | 1 | 0 | 0 | 2018 | 2 | 3 |
| 2243937 | WALGREENS #2025 | WALGREENS #2025 | 18593 | Grocery Store | Risk 3 (Low) | 3000 S HALSTED ST | CHICAGO | IL | 60608 | 2018-12-31 | Canvass | Pass |
|
41.83977 | -87.64647 | {‘longitude’: ‘-87.64646751112869’, ‘latitude’: ‘41.839771946583845’} | 26 | 14920 | 58 | 59 | 48 | 1 | 0 | 1 | 2018 | 1 | 3 |
In this section, we are visualizing the data.
Below, we create an interactive map of Chicago. You will be able to scroll through Chicago and see which business passes or fails inspection, which business has a high or low risk level, where these businesses are located and the name of the business
To get a better understanding of the data, we sample 2000 entries.
Here is a quick guide on how to interpret the map below:
Icon: Indicates Result
Pass- thumbs-up
Fail- thumbs-down
Pass with Conditions- check-circle
MarkerColor: Indicates Risk Value
Highest = 3 = red
Medium = 2 = white
Low = 1 = blue
The business name pops up when the icon is clicked.
The cluster colors bear no significance to the data.
#sampling 2000 entries
dat <- chicago %>%
filter(!is.na(`Longitude`)) %>%
sample_n(2000)
getResult <- function(dat) {
ifelse(dat$Results == "Pass", 'thumbs-up',
ifelse(dat$Results == "Fail", 'thumbs-down', 'check-circle'))
}
getRiskValue <- function(dat) {
ifelse(dat$`Risk Value` == 3, 'red',
ifelse(dat$`Risk Value` == 1, 'white', 'blue'))
}
#icon functions
icons <- awesomeIcons(
icon = getResult(dat),
#iconColor = getResultColor(dat),
markerColor = getRiskValue(dat),
library = 'fa'
)
leaflet(dat) %>%
addTiles() %>%
addAwesomeMarkers(~Longitude,
~Latitude,
icon = icons,
clusterOptions = markerClusterOptions(),
popup = ~(`DBA Name`))%>%addFullscreenControl()
chicago %>%
ggplot(aes(x=Results, fill = Results))+
geom_bar()
That looks better! We can see that overall, there are more pass than there are fails and pass with conditions. However, we see all the results from 2010 - 2018.
To get a better understanding of how businesses are performing over the years, let’s break down the results by year using cut() and breaks = 9 based on year. This will make a plot with 9 small plots (similiar to the one above) within it. One plot for each year!
Ask about years
chicago %>%
#create 8 time periods by year
mutate(discrete_period = factor(year)) %>%
ggplot(aes(x=Results, fill = Results)) +
geom_bar() +
facet_wrap(~discrete_period) +
theme(legend.position = "top")+
#make the labels
ggtitle("Results vs Year")
We can now get a better idea of the amount of passing businesses per year. As we can see from the plots above, there is a trend of more passing businesses than failing businesses. We can also see that passing with conditions increases across the years.
Now, let’s look at how risks, high, low and medium, are broken down by year.
chicago %>%
#create 8 time periods by year
mutate(discrete_period = factor(year)) %>%
#group by risk
group_by(Risk) %>%
#
ggplot(aes(x=Risk, fill = Risk)) +
geom_bar() +
facet_wrap(~discrete_period) +
theme(legend.position = "top")+
ggtitle("Risk vs Year")
logistic regression facility tpe
There seems to be no trend between risk and year.
Another way to visualize the pass and pass with conditions per year, is to create a scatter plot, where each plot presents the mean of passes per year.
chicago %>%
group_by(year)%>%
summarize(passMean = mean(Pass), total = sum(Pass))%>%
ggplot(aes(x=total, y = passMean))+
geom_point()+theme_minimal()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the above graph, it is evident that as the years progress, the average number of passes increases minimally.
Let’s look at some other information and see the distribution of the passing inspections, both passing and passing with conditions, of Bakeries in Chicago over time.
bakeries <-chicago %>%
filter(`Facility Type` == "Bakery")%>%
group_by(year)%>%
summarize(passBakeryNum = mean(countBakeryPass))%>%
ggplot(aes(x=year, y = passBakeryNum))+
geom_point()+theme_minimal()+geom_smooth()
bakeries
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
It seems that the average of passing bakeries is unchanging per year. On average the proportion of passing bakeries is stagnent.
Since there does not appear to be a relationship between passing bakeries and year, we will try plotting the distribution of grocery stores passing inspections (which includes passing with conditions), over time.
chicago %>%
filter(`Facility Type` == "Grocery Store")%>%
group_by(year)%>%
summarize(passGSNum = mean(countGSPass))%>%
ggplot(aes(x=year, y = passGSNum))+
geom_point()+theme_minimal()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
There seems to be a releationship between grocery stores passing over time however, it does not seem to be linear. We will test this by creating a linear regression model to ensure that this relationship is non-linear. If this relationship turns out to be non-linear, we will try exploring a logistic regression model to see if we get better results!
Experiment Design and Hypothesis
Based on the Grocery Stores Passing vs Year graph, we believe that there is no linear correlation between grocery stores passing and year. One popular way of determining if there is a correlation is to reject the hypothesis also known as the Null hypothesis.
Our null hypothesis is that there is a linear correlation between grocery stores passing and year.
To test our hypothesis, we build a linear model between grocery stores passing and year. Linear models are used in data analysis. It allows to us to construct confidence intervals and hypothesis testing between variables.
In order to test only grocery stores, we will create a new dataset from the Chicago dataset grouped by year. We create a new attribute -> mean_pass which is the mean of Pass per year. From this new dataset, we create a linear regression model between mean_pass and year.
Should we use mean or sum for our model
stores_mean <- chicago %>%
filter(`Facility Type` == "Grocery Store") %>%
group_by(year) %>%
summarize(mean_pass = mean(Pass)) %>%
select(year, mean_pass)
#put in a different code block
mean_lm <- lm(mean_pass ~ year, data=stores_mean)
summary(mean_lm)
##
## Call:
## lm(formula = mean_pass ~ year, data = stores_mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05683 -0.01837 0.00114 0.03365 0.04454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.292775 9.515287 -1.502 0.177
## year 0.007460 0.004725 1.579 0.158
##
## Residual standard error: 0.0366 on 7 degrees of freedom
## Multiple R-squared: 0.2626, Adjusted R-squared: 0.1573
## F-statistic: 2.493 on 1 and 7 DF, p-value: 0.1583
Multiple R-squared is used to evaluate how well the model fits the data. It identifies how much of the variance in the predicated variable, or dependent variable, can be explained by independent variable. For example, our R-squared value of 0.2626 can explain 0.2626 % of the variation in the outcome. Therefore, this indicates there is close to no correlation between passing and year. We concur with our null hypothesis and reject our original hypothesis.
Let’s try to see if the data is skewed, meaning let’s check if the mean is either less than or greater than the median.
# The best way to extract these values is to make a data frame from the linear model
quantiles <- quantile(mean_lm$residuals)
quantiles
## 0% 25% 50% 75% 100%
## -0.056827652 -0.018365843 0.001139828 0.033652606 0.044544230
To determine if there is a skew in the data, we need to check that the First Quartile, or the 25% column, and the Third Quartile, the 75% column, are equal distance from the median. To determine that, you can simply subtract the median from the third quartile and check if that value is equal to the first quartile subtracted from the median.
In short we have to check:
75% column - median = median - 25% column
0.3662738 - 0.2904350 = ?0.2904350 - -0.5578873
0.0758388 =? 0.8483223
Since the right side is not equal to the left side, the data is skewed. handling skew
Let’s construct a confidence interval so we can say how precise we think our estimates of “*****”. We want to see how precise our estimate is. We will calculate a standard error estimate for B1 and we will construct a 95% confidence interval
#gather the stats from the linear_model using tidy() which is defined by the broom package.
stats <- mean_lm%>%
tidy() %>%
select(term, estimate, std.error)
stats
## # A tibble: 2 x 3
## term estimate std.error
## <chr> <dbl> <dbl>
## 1 (Intercept) -14.3 9.52
## 2 year 0.00746 0.00472
confidence_interval_offset <- 1.95 * stats$std.error[2]
confidence_interval <- round(c(stats$estimate[2] - confidence_interval_offset,
stats$estimate[2],
stats$estimate[2] + confidence_interval_offset), 4)
confidence_interval
## [1] -0.0018 0.0075 0.0167
Discuss what everything means +-
mean_augment <- mean_lm %>%
augment()
mean_augment %>% head()
## # A tibble: 6 x 9
## mean_pass year .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.684 2010 0.703 0.0225 -0.0184 0.378 0.0384 0.123 -0.636
## 2 0.673 2011 0.710 0.0187 -0.0374 0.261 0.0353 0.250 -1.19
## 3 0.710 2012 0.717 0.0154 -0.00768 0.178 0.0394 0.00579 -0.231
## 4 0.759 2013 0.725 0.0131 0.0345 0.128 0.0365 0.0747 1.01
## 5 0.766 2014 0.732 0.0122 0.0337 0.111 0.0367 0.0595 0.975
## 6 0.784 2015 0.740 0.0131 0.0445 0.128 0.0344 0.124 1.30
describe what we are looking for.
mean_augment %>%
ggplot(aes(x=.fitted,y=.resid)) +
geom_point() +
geom_smooth() +
labs(x="fitted", y="residual")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
As we can see above, there is no linear relationship between grocery stores passing and year. So we will try using a logisic model using a few more facility types.
We are creating a new dataset called facilities which only stores information on bakeries, grcoery stores and restaurants.
facilities <- chicago %>%
select(year, `Facility Type`, Results, Pass)%>%
filter(`Facility Type` == "Bakery" | `Facility Type` == "Grocery Store" | `Facility Type` == "Restaurant")
Great! Now that we grabbed those attributes, we want to know if there is a correlation between facility type and year with passing results.
#We are trying to see if there is a relationship for results and year.
logistic_model = glm(formula = Pass ~ year+`Facility Type`, data = facilities, family="binomial")
summary(logistic_model)
##
## Call:
## glm(formula = Pass ~ year + `Facility Type`, family = "binomial",
## data = facilities)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7990 0.6648 0.6817 0.7047 0.8142
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.51254 5.41767 -6.740 1.59e-11 ***
## year 0.01870 0.00269 6.951 3.64e-12 ***
## `Facility Type`Grocery Store -0.13362 0.05093 -2.624 0.008700 **
## `Facility Type`Restaurant 0.17998 0.04904 3.670 0.000243 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136310 on 129311 degrees of freedom
## Residual deviance: 135921 on 129308 degrees of freedom
## AIC: 135929
##
## Number of Fisher Scoring iterations: 4
#We are trying to see if there is a relationship for results and year.
interactive_logistic_model = glm(formula = Pass ~ `Facility Type`*year, data = facilities, family="binomial")
summary(interactive_logistic_model)
##
## Call:
## glm(formula = Pass ~ `Facility Type` * year, family = "binomial",
## data = facilities)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7880 0.6722 0.6835 0.6987 0.8513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.318e+00 3.930e+01 0.059 0.9530
## `Facility Type`Grocery Store -9.903e+01 4.134e+01 -2.396 0.0166 *
## `Facility Type`Restaurant -2.605e+01 3.977e+01 -0.655 0.5124
## year -5.836e-04 1.951e-02 -0.030 0.9761
## `Facility Type`Grocery Store:year 4.911e-02 2.052e-02 2.393 0.0167 *
## `Facility Type`Restaurant:year 1.303e-02 1.975e-02 0.660 0.5095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136310 on 129311 degrees of freedom
## Residual deviance: 135893 on 129306 degrees of freedom
## AIC: 135905
##
## Number of Fisher Scoring iterations: 4
interactive_logistic_model_stats <- interactive_logistic_model %>% tidy()
interactive_logistic_model_stats
## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.32 39.3 0.0590 0.953
## 2 `Facility Type`Grocery Store -99.0 41.3 -2.40 0.0166
## 3 `Facility Type`Restaurant -26.1 39.8 -0.655 0.512
## 4 year -0.000584 0.0195 -0.0299 0.976
## 5 `Facility Type`Grocery Store:year 0.0491 0.0205 2.39 0.0167
## 6 `Facility Type`Restaurant:year 0.0130 0.0197 0.660 0.509
year<-interactive_logistic_model_stats$estimate[4]
estimate<- interactive_logistic_model_stats%>%
mutate(`Facility Type` = term, estimate = estimate)%>%
select(`Facility Type`, estimate)%>%
slice(5:6)
estimate_df <- data.frame(estimate)
estimate_df <- estimate_df %>%
mutate(increase = year+ estimate)
estimate_df
## Facility.Type estimate increase
## 1 `Facility Type`Grocery Store:year 0.0491134 0.04852979
## 2 `Facility Type`Restaurant:year 0.0130257 0.01244209
When bakeries is the intercept, grocery stores on average pass 0.04852979 % more per year than bakeries while Restaurants pass on average 0.01244209% more than bakeries.
interaction over time maybe plot take year as factor to get separate estiate for each year so non linear
augmented_logistic <- logistic_model %>%
augment()
augmented_logistic %>% head()
## # A tibble: 6 x 10
## Pass year Facility.Type .fitted .se.fit .resid .hat .sigma .cooksd
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 2018 Restaurant 1.40 0.0130 -1.80 2.69e-5 1.03 2.72e-5
## 2 1 2018 Restaurant 1.40 0.0130 0.665 2.69e-5 1.03 1.66e-6
## 3 1 2018 Restaurant 1.40 0.0130 0.665 2.69e-5 1.03 1.66e-6
## 4 1 2018 Restaurant 1.40 0.0130 0.665 2.69e-5 1.03 1.66e-6
## 5 1 2018 Grocery Store 1.08 0.0196 0.763 7.29e-5 1.03 6.17e-6
## 6 1 2018 Restaurant 1.40 0.0130 0.665 2.69e-5 1.03 1.66e-6
## # … with 1 more variable: .std.resid <dbl>
augmented_logistic %>%
ggplot(aes(x=.fitted,y=.resid)) +
geom_point() +
geom_smooth() +
labs(x="fitted", y="residual")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'